A thesaurus-based statistical language model for broadcast news transcription
نویسندگان
چکیده
This paper describes a thesaurus-based class n-gram model for broadcast news transcription. The most important issue concerned with class n-gram models is how to develop a word classification. We construct a word classification mapping based on a thesaurus so as to maximize the average mutual information function on a training corpus. To examine the effectiveness of the new method, we compare it with two our previous methods, in which the same thesaurus is used but word-class mappings are determined in different manners. The new method achieved substantially lower perplexity for 83 news transcription sentences broadcast on June 4, 1996.
منابع مشابه
A system for the retrieval of Italian broadcast news
This paper presents a prototype for the retrieval of Italian broadcast news, which has been developed at ITC-irst. The architecture employs a speech recognition engine for the automatic transcription of audio news. Moreover, it features document indexing based on part-of-speech tagging of text coupled with morphological analysis, and query expansion exploiting the Italian WordNet thesaurus. Que...
متن کاملApplying a Grammar-Based Language Model to a Simplified Broadcast-News Transcription Task
We propose a language model based on a precise, linguistically motivated grammar (a hand-crafted Head-driven Phrase Structure Grammar) and a statistical model estimating the probability of a parse tree. The language model is applied by means of an N-best rescoring step, which allows to directly measure the performance gains relative to the baseline system without rescoring. To demonstrate that ...
متن کاملOnline Temporal Language Model Adaptation for a Thai Broadcast News Transcription System
This paper investigates the effectiveness of online temporal language model adaptation when applied to a Thai broadcast news transcription task. Our adaptation scheme works as follow: first an initial language model is trained with broadcast news transcription available during the development period. Then the language model is adapted over time with more recent broadcast news transcription and ...
متن کاملAdvances in automatic transcription of Italian broadcast news
This paper presents some recent improvements in automatic transcription of Italian broadcast news obtained at ITCirst. A first preliminary activity was carried out in order to develop a suitable speech corpus for the Italian language. The resulting corpus, formed by recordings covering 30 hours of radio news, was exploited for developing a baseline system for transcription of broadcast news. Th...
متن کاملLanguage modeling for automatic turkish broadcast news transcription
The aim of this study is to develop a speech recognition system for Turkish broadcast news. State-of-the-art speech recognition systems utilize statistical models. A large amount of data is required to reliably estimate these models. For this study, a large Turkish Broadcast News database, consisting of the speech signal and corresponding transcriptions, is being collected. In this paper, infor...
متن کامل